\documentclass{article}
\usepackage[utf8]{inputenc}

\title{Music Language Model Decoding for AMT}
\author{Adrien Ycart, Andrew McLeod}
\date{January 2018}

\usepackage[round]{natbib}
\usepackage{graphicx}
\usepackage{a4wide}
\usepackage{hyperref}
\usepackage{amsmath}

\begin{document}

\maketitle

\section{Introduction}

The goal of this project is to investigate language-model decoding in the context of AMT.
The general idea is to use a symbolic model of music to assess the likelihood of candidate solutions, obtained via an acoustic model.

This process is different from \citep{ycart2018polyphonic}.
In that paper, an LSTM was trained to convert acoustic model output to binary piano-rolls.
The model was trained on pairs of outputs of a specific acoustic model (obtained from audio files) and MIDI transcriptions of the audio file.
It required aligned pairs of audio recordings and MIDI transcriptions.
It was also trained for one specific acoustic model.

In the current study, the language model is trained on symbolic data only.
It can also be used with any acoustic model (unless some acoustic-model-specific fine-tuning is done on the language model, see \ref{sec:schedsamp})
It is similar to what it done in \citep{sigtia2016end}.
The main difference is that we will investigate the effect of various time-steps on performance:
 \cite{sigtia2016end} used a 32ms timestep (see Section IV.C from their paper), which we argue is not suited for music language modelling (see \citep{Ycart2017} and \citep{Korzeniowski2017}).
The language model will also have a different architecture: 
\citep{sigtia2016end} used the RNN-RBM \citep{Boulanger-Lewandowski2012}, while we will use a simple LSTM, similar to \citep{Ycart2017}.

A similar language-model decoding was also presented in \citep{app8030470} for AMT.
They used the RNN-RBM as language model, but using an "event" timestep (i.e. one timestep per new onset, without considering the duration of each note).
It also echoes \citep{Korzeniowski2018} in the context of chord recognition.
We also aim to evaluate the performance using this kind of timestep.

\section{Models}

\subsection{Acoustic model}

As acoustic model, we will use \citep{Kelz2016}.
It is a state-of-the-art model, only recently surpassed by \citep{Hawthorne2018}.
It also uses the same architecture as in \citep{sigtia2016end}, with only slight modifications.
It outputs pitch likelihoods as independent Bernouilli variables.

\subsection{Language model}

As language model, we will use a simple LSTM, as described in \citep{Ycart2017}.
The main argument for using it is to determine whether the interesting qualitative properties displayed by the 16th note model on symbolic prediction can bring some kind of improvement in the context of AMT.


\section{General workflow}
\label{sec:workflow}

Our system will operate on acoustic model outputs, in the form of a $88\times T$ matrix $M$,
with real values between 0 and 1.
It will output a matrix of same size, with binary values.

The system will find the best global solution by keeping a heap of candidate binary pianorolls
for timesteps $0,..,t$.
At each iteration, it will update this heap of candidates with possible continuations at timestep $t+1$. We will use beam search to make finding the solution tractable, with a beam of size $B$, saving only the $B$ most likely candidates at each timestep.
At timestep $T-1$, it will select the candidate with highest likelihood as the global solution.

More specifically, at each timestep $t$, the general algorithm is the following:

\begin{enumerate}
\item Get $K$ binary continuation for timestep $t$ ($K$ is the branching factor)
\item For each candidate in the heap, compute the likelihood of the candidate concatenated with the continuation
\item Update the heap based with these continued candidates, and discard those outside of the beam
\end{enumerate}

\section{Datasets}

The acoustic model is trained on the MAPS dataset \citep{emiya2010multipitch}.

The language model will be trained using different kinds of data.
Unquantised (i.e. performance) MIDI data will be obtained from the International Piano-e-Competition\footnote{\url{http://piano-e-competition.com/}},
and from the Piano-Midi.de dataset\footnote{\url{http://piano-midi.de/}}.
Quantised data will be taken from Boulanger-Lewandowski's datasets\footnote{\url{http://www-etud.iro.umontreal.ca/~boulanni/icml2012}},
from the Piano-Midi.de dataset (more recent version than the one used by Boulanger-Lewandowski),
as well as from MIDI data scraped from various websites.
The scraped data will be checked to make sure it corresponds to piano solo music, and that it is indeed quantised.
We will also make sure that it doesn't contain pieces that are in the MAPS test set.

For testing and for training in the proof-of-concept experiment (see \ref{sec:POC}), the A-MAPS \citep{ycart2018maps} rhythm annotations will be used.

\section{Experiments}

\subsection{Preliminary experiment}

We will run a quick experiment to show that a transduction model is fitted to one specific acoustic model, and cannot be used on a different one.
To do so, we will train the model from \citep{ycart2018polyphonic} using outputs from Kelz's model \citep{Kelz2016}.
We will then test it with another acoustic model, probably just a slightly modified version of  \citep{Kelz2016} (for instance with fewer layers).
We expect to see no improvement, maybe even a decrease in performance from using LSTM transduction decoding in the mismatched configuration, despite using only slightly different acoustic model.

\subsection{Proof-of-concept experiment}
\label{sec:POC}

We will compare 3 kinds of timesteps for the language model decoding:

\begin{enumerate}
\item \emph{time}: 40ms timesteps, similar to \citep{sigtia2016end} (Kelz's uses a framerate of 25 fps).
\item \emph{quant}:16th-note timesteps, as recommended in \citep{Ycart2017}
\item \emph{event}: event timesteps, as in \citep{app8030470}
\end{enumerate}

In this first experiment, we will use rhythm ground truth annotations from A-MAPS for 16th note timesteps, and ground-truth onset annotations for event timesteps.
The output of the acoustic model will be downsampled to the correct timestep using the \emph{step}
strategy described in \citep{ycart2018polyphonic}.

To compare all 3 timesteps, results will have to be converted back to 40ms timesteps.
For fair comparison, we will also evaluate the performance of 40ms timesteps and event timesteps using 16th-note quantisation as a post processing step.

\subsection{Real-life setting experiment}

We will again compare the same 3 kinds of timesteps, but without using ground truth annotations.
For 16th note timesteps, beat annotations will be obtained with a beat-tracking algorithm (probably taken from the madmom library \citep{madmom}).
For event timesteps, onset annotations will be obtained with an onset detection algorithm (probably also taken from madmom).

Discussion of the results will include an investigation on the correlation between success of the pre-processing method (beat tracking and onset detection) and the performance of the system, fo each test file.
One problem is that when using ground-truth annotations, we will use real 16th note timesteps, as written on the score, while in the real-life setting, we will use a fourth of a beat (which is not the same thing in the case of ternary meter).
We will also investigate how the performance of the system varies depending of the time signature of the test pieces (the time signatures are available in A-MAPS).

We would likely leave the use of probabilistic onset or pitch detections for future work.

\section{Challenges}

For each of the steps outlined in section \ref{sec:workflow}, we outline some challenges and ideas to overcome them.

\subsection{Language model training}
\label{sec:focalloss}

We have advocated for the use of 16th note timesteps in \citep{Ycart2017}, as shorter, constant-length timesteps seemed to only learn to repeat the previous active note.
However, 16th note timesteps present the disadvantage of requiring beat annotations.
On the other hand, there exist strategies to force networks to put emphasis during training on what they don't correctly detect, namely the focal loss \citep{lin2018focal}.
The idea is to weight the training loss for each bin by a factor that is greater when the network doesn't detect correctly the bin or is uncertain, and lower when the bin is correctly detected.
This simple strategy might help 40ms timestep models focus on transitions, that they incorrectly predict, rather than on continued notes.
We will investigate whether it actually does.

\subsection{Get continuations at timestep $t$}
\label{sec:sampling}

\subsubsection{Strategies}

At each timestep, we will have to get a set of binary continuations.
A naive strategy would be to sample binary vectors from the acoustic model independent pitch-wise distributions.

Given a vector of independent Bernouilli variables, \citep{Boulanger-Lewandowski2013} gives an algorithm to enumerate binary vectors in decreasing order of likelihood.
We will use this method.
We still have to define which likelihoods to take.

We identify 3 main strategies:

\begin{enumerate}
\item Sample from acoustic model distributions only
\item Sample from the product of the acoustic model and language model distributions
\item Sample from acoustic model on one side, and from the language model on the other, and use the union of these two sets of continuations
\end{enumerate}

In the first case, a note not detected by the acoustic model cannot be present in the output; the language model decoding can only eliminate false positives.
Also, the set of continuations will be the same for each candidate sequence.
In the two last cases, the set of continuation will depend on the language model, and thus on the candidate sequences, which might represent a potentially significant extra computational cost.

Experiments will include a comparison of these strategies.

\subsubsection{Evaluating these strategies}

To have an idea of the potential success of each of these strategies, we could compute at each time step the rank of the ground-truth in the list of most likely continuations.
If this rank is lower than the branching factor, the correct solution will not be evaluated, so it is not possible to obtain it.
A more computationally efficient method would be to set a branching factor (e.g. 20), and check the proportion of frames for which the ground truth is in the first 20 most likely samples.
We could then plot this proportion for various branching factor values (e.g. 10, 20, 50, 100, 200).

This is easy to do when only the acoustic model is considered.
When the language model is also taken into account, the samples chosen at times $1 ... t$ influence the distribution at time $t+1$.
One solution would be to use the ground truth sequence, to obtain an upper bound for performance.
Another solution, more realistic but more computationally expensive, is to evaluate it for each sequence in the beam at each timestep.



\subsection{Compute the likelihood of the extended sequences}
\label{sec:schedsamp}

The likelihood will be obtained as the product of acoustic model and language model likelihoods (potentially with a weighting factor between the two).

One problem that might arise is that the MLM is trained on perfect sequences, and tested on imperfect, sampled sequences.
As a result, the MLM might give very bad predictions.
One way to overcome that is to add noise to the input sequences when training the MLM, to make it robust to noise.

One naive solution is to add noise by randomly sampling false negatives (at some rate) and false positives (from a flat distribution).
For a more sophisticated solution, we could implement something similar to scheduled sampling  \citep{Bengio2015}.
The idea is to replace during training some timesteps with solutions sampled from some probability distribution.
At the beginning of training, we always use perfect sequences.
Throughout training, we gradually incorporate more and more sampled timesteps.

More precisely, let $X$ be the perfect sequence, and $\hat{x}$ the sequence of prediction probability distributions
($\hat{x_t}$ is predicted from $X_{1...t-1}$).
At each timestep $t$, to choose the input to the network $x_t$, we toss a coin,
and choose $X_t$ with probability $p$, or $sample(\hat{x_t})$ with probability $1-p$, where $sample$ corresponds to drawing a binary sample from the distribution.
We start training with $p=1$, and gradually decrease $p$ during training.



\subsection{Update the heap}
\label{sec:update}

The heap of candidates is likely to get saturated with lots of near-duplicates: sequences that vary only by a few timesteps.
One solution to favour variety in the candidates is to use hashed beam search (used by \citep{Korzeniowski2018}):
from all candidates that are identical in the previous $M$ frames, we only keep the one that has the highest likelihood, and discard the others.


\section{Evaluation}

\subsection{Baseline}

We will compare our system to various baselines:

\begin{itemize}
\item Simple threshold at $0.5$
\item Threshold + median filtering
\item HMM smoothing \citep{Poliner2006}
\end{itemize}

When using \emph{quant} and \emph{event}, the output of our system will be converted back to \emph{time} timesteps (40ms).
For each baseline system, we will also report the performance of the system using 16th-note quantisation as a post processing step, for fair comparison with \emph{quant} timesteps.

\subsection{Benchmark evaluation}

We will evaluate the performance of our system with frame-wise and note-wise F-measure, as usually done in AMT.
To compare various processing timesteps, all results will be converted back to 40ms timesteps.

\subsubsection{Frame-wise F-mesure}

F-measure is computed by making a direct binary comparison between the output and ground-truth piano-rolls, following the MIREX Multiple-F0 Estimation task \citep{Bay2009}, 

For each file, we compute:

\[
P = \frac{\sum_{t=0}^{T}TP(t)}{\sum_{t=0}^{T}TP(t) - FP(t)} ,
R = \frac{\sum_{t=0}^{T}TP(t)}{\sum_{t=0}^{T}TP(t) - FN(t)} ,
F = \frac{2 \cdot P \cdot R}{P + R} 
\]
where $TP(t)$, $FP(t)$ and $FN(t)$ are the number of true positives, false positives and false negatives at time step $t$, respectively. 
Then these measures are averaged over the test set.

\subsubsection{Note-wise F-mesure}

The piano-rolls are first converted to a list of notes.
Given a piano-roll matrix $P$, a note $(p,s,t)$ with $p$ the MIDI pitch, $s$ its start time and $e$ its end time must fulfil the following conditions:

\begin{itemize}
\item $P[p,s-1] = 0$ and $P[p,s] = 1$
\item $P[p,e-1] = 1$ and $P[p,e] = 0$
\end{itemize}

A note is correctly detected if its pitch is correct and its onset is within 50ms of the corresponding ground truth onset. We don't take into account offset.
We then compute the precision, recall and F-measure scores using the \texttt{mir\_eval} implementation \citep{raffel2014mireval}.

\subsection{Further evaluation}

We will consider using new metrics, that we define in sections \ref{sec:semitone} and \ref{sec:outkey}.
We might also conduct some very basic listening tests, with a tentative protocol defined in section \ref{sec:listening}.
 
\subsubsection{Semitone errors}
\label{sec:semitone} 


Particularly common mistakes in AMT are semitone errors.
We define a frame-wise semitone error as follows: given $E$ and $T$ the estimated and target piano-rolls, $p$ a pitch and $t$ a time-step, $E[p,t]$ is a semitone error if and only if: 
\vspace{-0.2cm}
\begin{align*}
&E[p,t]= 1 \wedge T[p,t]= 0 \wedge
(T[p-1,t]= 1 \vee T[p+1,t]= 1)
\end{align*}

\vspace{-0.2cm}
We then compute two different ratios: let $N_s$ be the number of semitone errors, $N_f$ the number of frames, and $N_e$ the total number of false positives:
\vspace{-0.2cm}
\[
\mathcal{S}_{f,t} = \frac{N_s}{N_f} \textrm{ and } \mathcal{S}_{f,p} = \frac{N_s}{N_e}
\]

\vspace{-0.2cm}
The first is correlated to $\mathcal{P}$, while the second can be biased when $\mathcal{P}$ is very high (an output with only 1 error that happens to be a semitone errors will have $\mathcal{S}_{f,p}=1$)
%
We do not define a note-wise semitone error due to the potential confusions with passing notes within a tolerance range.
%This issue with passing notes has less influence with frame-wise metrics, as only part of a note can be considered as a semitone error.

\subsubsection{In-key and Out-of-key Errors}
\label{sec:outkey}

We assume that out-of-key spurious notes are particularly salient, and should be avoided. On the other hand, in-key errors, although they should still be avoided, are less problematic. A note is considered out of key if it is both outside of the current key, and not in the ground truth, such that we do not penalise passing notes, or short modulations.

We define the proportion of in-key and out-of-key errors both on the frame-wise and note-wise level.
For frame-wise out-of-key errors, the implementation is straightforward: given $k(n)$ a key template with tonic note $n$, $n(t)$ the tonic at time-step $t$,
$E[p,t]$ is an out-of-key error if and only if :
\vspace{-0.2cm}
\[
E[p,t] = 1 \wedge T[p,t] = 0 \wedge p \notin k(n(t))
\]

\vspace{-0.2cm}
For note-wise metrics, we consider that a note is in-key if its pitch is in the key at the time of its onset. We include a tolerance threshold $t_k$ around key changes: $n=(p,s,e)$ is an out-of-key error
if and only if: 
\vspace{-0.2cm}
\setlength{\jot}{0pt}
\begin{align*}
& n \text{ is false pos.} \wedge (p \in k(n(s)) \vee p \in k(n(s + t_k)))
\end{align*}

\vspace{-0.2cm}
We do not allow tolerance after a key change (when $p \in k(n(s - t_k))$), as in this case, the whole duration of the note is in a different key. We compute once again two different ratios both for frame-wise and note-wise metrics:  
let $N_i$  be the number of in-key errors, $N_f$ the total number of frames (total number of notes in ground truth for note-wise metrics), $N_e$ the total number of false positives:

\vspace{-0.35cm}
\[
\mathcal{I}_{\_,t} = \frac{N_i}{N_f} \textrm{ and } 
\mathcal{I}_{\_,p} = \frac{N_i}{N_e}
\]

\vspace{-0.10cm}
Out-of-key errors $\mathcal{O}_{\_,t}$ and $\mathcal{O}_{\_,p}$ are defined analogously.
Again, the first ratios are correlated to $\mathcal{P}$, while the second can have strong variations when $\mathcal{P}$ is high. 
We also have the relation $\mathcal{I}_{\_,p} + \mathcal{O}_{\_,p} = 1$. We thus only report $\mathcal{O}_{\_,p}$.

To determine if a note is in-key or out-of-key, key annotations are needed.
In the present experiments, we rely on key annotations of the A-MAPS database. 
However, they do not take into account the short modulations that can appear for a few bars in a piece; only the changes in the key signature of the original score of a piece are reported.
Moreover, all the keys are written as the major relative, even when the scale is minor (for instance, a piece in A minor will be annotated as C major).
The key annotations thus are not perfect, but we consider that they can still bring some useful perspective on the kind of errors made.

\subsubsection{Listening tests}
\label{sec:listening}

Defining a perception-based evaluation metric is a whole other, ongoing project, that hopefully will yield results quick enough.
In the meantime, we can define the following protocol:

\begin{itemize}
\item Randomly choose a piece from the test set
\item Randomly choose 2 model among: Kelz's model, 40ms MLM decoding, 16th-note MLM decoding, event MLM decoding.
\item Present pairs of outputs of about 15 seconds corresponding to the piece drawn, evaluated with the 2 model drawn.
\item Ask which one is more pleasant (e.g. first model/second model/equally pleasant)
\item Present the corresponding ground truth 
\item Ask which one is closer to ground truth (e.g. first model/second model/equally close)
\end{itemize}

We could also collect information about music expertise of participants (Gold-MSI).
This test could be rather quick to do, if we ask participants to only rate 10 to 20 excerpts.
We could also conduct this test online, as we do not consider timbre to be an important parameter
to consider for quality of the output, making it quite easy to get a enough participants.

Asking separately about pleasantness and closeness to ground-truth might help us highlight the "musicality" of our model, in the case where the output might be musical, but different from the target.

\section{Publication plans}

\begin{enumerate}
\item For ISMIR 2019: 
	\begin{enumerate}
	\item Proof-of-concept experiments
	\item Challenges \ref{sec:sampling}, \ref{sec:update}
	\item Evaluation over the test dataset, and for each test piece
	\end{enumerate}

\item For later journal paper:
	\begin{enumerate}
	\item Contents of ISMIR paper
	\item Preliminary experiment
	\item Real-life experiments
	\item Challenges \ref{sec:focalloss}, \ref{sec:schedsamp},
	\end{enumerate}
\end{enumerate}

\bibliographystyle{plainnat}
\bibliography{references}
\end{document}
